Before we get started, let’s load in our required libraries and data.
library(tidyverse)
## ── Attaching packages ──────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df <- read_csv('2019_Iowa_Liquor_Sales.csv')
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Invoice/Item Number` = col_character(),
## Date = col_character(),
## `Store Name` = col_character(),
## Address = col_character(),
## City = col_character(),
## `Store Location` = col_character(),
## County = col_character(),
## `Category Name` = col_character(),
## `Vendor Number` = col_character(),
## `Vendor Name` = col_character(),
## `Item Description` = col_character()
## )
## See spec(...) for full column specifications.
df <- na.omit(df)
Great! Now let’s do some preliminary data cleaning. We’ll extract the latitude and longitude from the Store Location, make sure our variables are the correct types, and extract the day, month, and year from Date.
df$temp <- gsub('[()]', '', substr(df$`Store Location`, 7, 10000))
df <- df %>%
separate(temp, c('Longitude', 'Latitude'), convert = TRUE, sep=' ')
df <- df %>%
separate(Date, c('Month', 'Day', 'Year'), convert = TRUE, remove = FALSE)
df$Date <- as.Date(df$Date, '%m/%d/%Y')
Now that we have that done, let’s plot a scatterplot of the liquor store locations.
ggplot(df, aes(Longitude, Latitude)) + geom_point() +
ggtitle('Latitude and Longitude of Liquor Stores in Ames')
Although there’s no map of Ames, we can sort of make out the liquor stores with our knowledge of the city. Around (42.03, -93.62) we can see a somewhat horseshoe-shaped group of liquor stores. Intuition tells us that this is likely Welch Avenue and surrounding areas, with the amount and shape of the stores on the map.
Now let’s break down the liquor categories and see the volume sold.
ggplot(df, aes(x = `Category Name`, weight = `Volume Sold (Gallons)`)) +
geom_bar() +
xlab('Liquor Category') +
ylab('Volume of Liquor Sold in Gallons') +
ggtitle('Volume of Liquor Sold in Ames by Category') +
coord_flip()
American Vodka is by far the most liquor category sold in Ames by almost 20,000 gallons. The next highest is Canadian Whiskies.
A large portion of American vodka is likely attributed to Hawkeye, and a large portion of Canadian whiskey is likely attributed to Black Velvet.
Now let’s summarize the daily liquor sales in Ames during 2019.
df2 <- df %>%
group_by(Date, Day, Month, Year) %>%
summarize(sales = n(), volume_l = sum(`Volume Sold (Liters)`),
volume_gal = sum(`Volume Sold (Gallons)`),
money_spent = sum(`Sale (Dollars)`))
df2
## # A tibble: 146 x 8
## # Groups: Date, Day, Month [146]
## Date Day Month Year sales volume_l volume_gal money_spent
## <date> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2019-01-03 3 1 2019 767 4944. 1302. 83304.
## 2 2019-01-04 4 1 2019 143 1256. 331. 16766.
## 3 2019-01-09 9 1 2019 701 5616. 1480. 82540.
## 4 2019-01-10 10 1 2019 2 135 35.6 1457.
## 5 2019-01-11 11 1 2019 153 1038. 273. 14427.
## 6 2019-01-16 16 1 2019 632 6748. 1779 95180.
## 7 2019-01-24 24 1 2019 900 7167. 1888. 105730.
## 8 2019-01-25 25 1 2019 253 2329. 614. 31893.
## 9 2019-01-30 30 1 2019 795 7132. 1880. 103468.
## 10 2019-01-31 31 1 2019 18 154. 40.7 2223
## # … with 136 more rows
Great! Now that we have that done, let’s plot volume sold by day and facet by month. This will help us see any trends or patterns.
df2 %>% ggplot(aes(x = Day, y = volume_gal)) +
geom_jitter() +
geom_smooth() +
facet_wrap(~Month) +
xlab('Day of Month') +
ylab('Volume Sold in Gallons') +
ggtitle('Volume of Alcohol Sold by Day During 2019 in Ames')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can see a couple interesting things looking at this graph. First, there’s a hard drop off at the end of April—this could be due to exams starting and students focusing more on their education.
There’s a fairly sharp increase at the end of July. Perhaps this is due to students coming back to Ames. There is also a slight increase at the end of August, perhaps this is because students want one last “hoorah” before classes resume.
Now let’s see if there’s a pattern for the ISU home football games in 2019. We’ll do this by plotting a time series of the volume of liquor sold. Since football starts in August, we’ll start the graph off in August and end it before December, since that’s all we need. The other months aren’t relevant. We can assume that there will be heavy spikes on the home football days, so we will label the days with over 1500 gallons sold.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- df2 %>%
filter(Month >= 8 & Month < 12) %>%
ggplot(aes(x = Date, y = volume_gal)) +
geom_line() +
geom_text(aes(x = Date, y = volume_gal, label = Date), data = df2 %>% filter(Month >= 8 & Month < 12 & volume_gal >= 1750))
ggplotly(p)
According to Cyclones.com, the home games are as follows:
Not every game spikes higher than 1750 gallons, but there is still a noticeable trend. In fact, liquor sales seem to spike 2-3 days before the game, which is fairly interesting—perhaps people are afraid of the stores selling out so they buy it beforehand?
Regardless, there is a noticeable trend with home games and liquor store sales spiking upwards.